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ABSTRACT 

Many of the current state-of-the-art Large Vocabulary Con¬ 
tinuous Speech Recognition Systems (LVCSR) are hybrids 
of neural networks and Hidden Markov Models (HMMs). 
Most of these systems contain separate components that deal 
with the acoustic modelling, language modelling and se¬ 
quence decoding. We investigate a more direct approach in 
which the HMM is replaced with a Recurrent Neural Net¬ 
work (RNN) that performs sequence prediction directly at the 
character level. Alignment between the input features and 
the desired character sequence is learned automatically by an 
attention mechanism built into the RNN. For each predicted 
character, the attention mechanism scans the input sequence 
and chooses relevant frames. We propose two methods to 
speed up this operation; limiting the scan to a subset of 
most promising frames and pooling over time the information 
contained in neighboring frames, thereby reducing source 
sequence length. Integrating an n-gram language model into 
the decoding process yields recognition accuracies similar to 
other HMM-free RNN-based approaches. 

Index Terms — neural networks, LVCSR, attention, 
speech recognition, ASR 

1. INTRODUCTION 

Deep neural networks have become popular acoustic mod¬ 
els for state-of-the-art large vocabulary speech recognition 
systems (Hinton et al., 2012a). However, in these systems 
most of the other components, such as Hidden Markov Mod¬ 
els (HMMs), Gaussian Mixture Models (GMMs) and n-gram 
language models, are the same as in their predecessors. These 
combinations of neural networks and statistical models are 
often referred to as hybrid systems. In a typical hybrid sys¬ 
tem, a deep neural network is trained to replace the Gaussian 
Mixture Model (GMM) emission distribution of an HMM by 
predicting for each input frame the most likely HMM state. 
These state labels are obtained from a trained GMM-HMM 
system that has been used to perform forced alignment. In 
other words, a two-stage training process is required, in which 
the older GMM approach is still used as a starting point. An 


obvious downside of this hybrid approach is that the acoustic 
model is not directly trained to minimize the final objective 
of interest. Our aim was to investigate neural LVCSR models 
that can be trained with a more direct approach by replacing 
the HMMs with a Attention-based Recurrent Sequence Gen¬ 
erators (ARSG) such that they can be trained end-to-end for 
sequence prediction. 

Recently, some work on end-to-end neural network 
LVCSR systems has shown promising results. A neural 
network model trained with Connectionist Temporal Clas¬ 
sification (CTC) (Graves et al., 2006) achieved promising 
results on the Wall Street Journal corpus (Graves and Jaitly, 
2014; Hannun et al., 2014b). A similar setup was used to 
obtain state-of-the-art results on the Switchboard task as well 
(Hannun et al., 2014a). Both of these models were trained to 
predict sequences of characters and were later combined with 
a word level language model. Furthermore, when the lan¬ 
guage model was implemented as a CTC-specific Weighted 
Finite State Transducer, decoding accuracies competitive 
with DNN-HMM hybrids were obtained (Miao et al., 2015). 

At the same time, a new direction of neural network re¬ 
search has emerged that deals with models that learn to focus 
their “attention” to specific parts of their input. Systems of 
this type have shown very promising results on a variety of 
tasks including machine translation (Bahdanau et al., 2015), 
caption generation (Xu et al., 2015), handwriting synthesis 
(Graves, 2013), visual object classification (Mnih et al., 2014) 
and phoneme recognition (Chorowski et al., 2014, 2015). 

In this work, we investigate the application of an Attention- 
based Recurrent Sequence Generator (ARSG) as a part of an 
end-to-end LVCSR system. We start from the system pro¬ 
posed in (Chorowski et al., 2015) and make the following 
contributions: 

1. We show how training on long sequences can be made 
feasible by limiting the area explored by the attention 
to a range of most promising locations. This reduces 
the total training complexity from quadratic to linear, 
largely solving the scalability issue of the approach. 
This has already been proposed (Chorowski et al., 
2015) under the name “windowing”, but was used only 


at the decoding stage in that work. 

2. In the spirit of he Clockwork RNN (Koutnik et ah, 

2014) and hierarchical gating RNN (Chung et ah, 

2015) , we introduce a recurrent architecture that suc¬ 
cessively reduces source sequence length by pooling 
frames neighboring in time. * 

3. We show how a character-level ARSG and n—gram 
word-level language model can be combined into a 
complete system using the Weighted Finite State Trans¬ 
ducers (WFST) framework. 

2. ATTENTION-BASED RECURRENT SEQUENCE 
GENERATORS FOR SPEECH 

The system we propose is a neural network that can map se¬ 
quences of speech frames to sequences of characters. While 
the whole system is differentiable and can be trained directly 
to perform the task at hand, it can still be divided into dif¬ 
ferent functional parts that work together to learn how to en¬ 
code the speech signal into a suitable feature representation 
and to decode this representation into a sequence of charac¬ 
ters. We used RNNs for both the encoder and decoder- parts 
of the system. The decoder combines an RNN and an atten¬ 
tion mechanism into an Attention-based Recurrent Sequence 
Generator that is able to learn the alignment between its in¬ 
put and its output. Therefore, we will hrst discuss RNNs, 
and subsequently, how they can be combined with attention 
mechanisms to perform sequence alignment. 

2.1. Recurrent Neural Networks 

There has been quite some research into Recurrent Neural 
Networks (RNNs) for speech recognition (Robinson et ah, 
1996; Lippmann, 1989) and this can probably be explained 
to a large extent by the elegant way in which they can deal 
with sequences of variable length. 

Given a sequence of feature vectors (xi, • • • , xt), a stan¬ 
dard RNN computes a corresponding sequence of hidden state 
vectors (hi, • • • , hr) using 

ht= g{Wa,h^t+'Whhiit-i+hh), (1) 

where W^h and yVkh are matrices of trainable parameters 
that represent the connection weights of the network and hh 
is a vector of trainable bias parameters. The function g{-) is 
often a non-linear squashing function like the hyperbolic tan¬ 
gent and applied element-wise to its input. The hidden states 
can be used as features that serve as inputs to a layer that per¬ 
forms a task like classification or regression. Given that this 

'This mechanism has been recently independently proposed in (Chan 
et al., 2015). 

^The word “decoder” refers to a network in this context, not to the final 
recognition algorithm. 


output layer and the objective to optimize are differentiable, 
the gradient of this objective with respect to the parameters of 
the network can be computed with backpropagation through 
time. Like feed-forward networks, RNNs can process discrete 
input data by representing it as 1-hot-coding feature vectors. 

An RNN can be used as a statistical model over sequences 
of labels. For that, it is trained it to predict the probability of 
the next label conditioned on the part of the sequence it has 
already processed. If (yi, • • • , yr) is a sequence of labels, an 
RNN can be trained to provide the conditional distribution the 
next label using 

p{yt\yi,-- ■ ,yt-i) =p{yt\iit) 

= softmax(W?,;ht -f b;), 

where W/i; is a matrix of trainable connection weights, b/ is 
a vector of bias parameters and softmaXi(a) = 

The likelihood of the complete sequence is now given by 
P{yi)Ul= 2 Piyt\yir ■ ■ ,yt-i)- This distribution can be used 
to generate sequences by either sampling from the distribu¬ 
tion p{yt\yl^ • • ■ , yt-i) or choosing the most likely labels it¬ 
eratively. 

Equation 1 dehnes the simplest RNN, however in prac¬ 
tice usually more advanced equations define the dependency 
of ht on h(_i. Famous examples of these so-called recur¬ 
rent transitions are Long Short Term Memory (Hochreiter and 
Schmidhuber, 1997) and Gated Recurrent Units (GRU) (Cho 
et al., 2014), which are both designed to better handle long¬ 
term dependencies. In this work we use GRU for it has a sim¬ 
pler architecture and is easier to implement efficiently. The 
hidden states hj are computed using the following equations: 

Zt = tT(Wa; 2 Xt -f Vhz^t-l), 

^ xr^t -f , 

ht = tanh -f 'Urh{rt <S> ht_i)), 

ht = (1 - zt)ht_i -1- ztht, 

where h( are candidate activations, z* and are update and 
reset gates respectively. The symbol 0 signihes element-wise 
multiplication. 

To obtain a model that uses information from both future 
frames and past frames, one can pass the input data through 
two recurrent neural networks that run in opposite directions 
and concatenate their hidden state vectors. Recurrent neu¬ 
ral network of this type are often referred to as bidirectional 
RNNs. 

Finally, it has been shown that better results for speech 
recognition tasks can be obtained by stacking multiple lay¬ 
ers of recurrent neural networks on top of each other (Graves 
et al., 2013). This can simply be done by treating the se¬ 
quence of state vectors (hi, • • • , hr) as the input sequence 
for the next RNN in the pile. Figure 1 shows an example of 
two bidirectional RNNs that have been stacked on top of each 
other to construct a deep architecture. 





Fig. 1. Two Bidirectional Recurrent Neural Networks stacked 
on top of each other. 



Fig. 2. A pooling over time BiRNN; the upper layer runs 
twice slower then the lower one. It can average, or subsample 
(as shown in the figure) the hidden states of the layer below 
it. 

2.2. Encoder-Decoder Architecture 

Many challenging tasks involve inputs and outputs which may 
have variable length. Examples are machine translation and 
speech recognition, where both input and output have variable 
length; and image caption generation, where the captions may 
have variable lengths. 

Encoder-decoder networks are often used to deal with 
variable length input and output sequences (Cho et al., 2014; 
Sutskever et al., 2014). The encoder is a network that trans¬ 
forms the input into an intermediate representation. The 
decoder is typically an RNN that uses this representation in 
order to generate the outputs sequences as described in 2.1. 

In this work, we use a deep BiRNN as an encoder. Thus, 
the representation is a sequence of BiRNN state vectors 
(hi,...,hL). Eor a standard deep BiRNN, the sequence 
(hi,..., Hl) is as long as the input of the bottom-most layer, 
which in the context of speech recongnition means one for 
every 10ms of the recordings. We found that for our decoder 


(see 2.3) such representation is overly precise and contains 
much redundant information. This led us to add pooling 
between BiRNN layers as illustrated by Eigure 2. 

2.3. Attention-equipped RNNs 

The decoder network in our system is an Attention-based Re¬ 
current Sequence Generator (ARSG). This subsection intro¬ 
duces ARSGs and explains the motivation behind our choice 
of an ARSG for this study. 

While RNNs can process and generate sequential data, the 
length of the sequence of hidden state vectors is always equal 
to the length of the input sequence. One can aim to learn 
the alignment between these two sequences to model a distri¬ 
bution p(t/i, • • • , j/Tlhi, • • • , hi,) for which there is no clear 
functional dependency between T and L. 

An ARSG produces an output sequence (j/i, • • • , ?/t) one 
element at a time, simultaneously aligning each generated el¬ 
ement to the encoded input sequence (hi, • • • , hi,). It is com¬ 
posed of an RNN and an additional subnetwork called ‘atten¬ 
tion mechanism’. The attention selects the temporal locations 
over the input sequence that should be used to update the hid¬ 
den state of the RNN and to make a prediction for the next 
output value. Typically, the selection of elements from the in¬ 
put sequence is a weighted sum Ct = where au are 

called the attention weights and we require that au > 0 and 
that Oiti = 1- See Eigure 3 for a schematic representation 
of an ARSG. 

The attention mechanism used in this work is an improved 
version of the hybrid attention with convolutional features 
from (Chorowski et al., 2015), which is described by the fol¬ 
lowing equations: 

F = Q * OLt-i (2) 
eu = tanh(Wst_i + Vh/ + Uf/ -f b) (3) 

au = exp(eti) / ^exp(et;) . (4) 

/ 1=1 

where W, V, U, Q are parameter matrices, w and b are pa¬ 
rameter vectors, * denotes convolution, St_i stands for the 
previous state of the RNN component of the ARSG. We ex¬ 
plain how it works starting from the end: (4) shows how the 
weights au are obtained by normalizing the scores eu- As il¬ 
lustrated by (3), the score depends on the previous state St_i, 
the content in the respective location h; and the vector of so- 
called convolutional features fj. The name “convolutional” 
comes from the convolution along the time axis used in (2) to 
compute the matrix F that comprises all feature vectors f;. 

Simply put, the attention mechanism described above 
combines information from three sources to decide where to 
focus at the step t: the decoding history contained in S(_i, 
the content in the candidate location h/ and the focus from 
the previous step described by attention weights Q:t_i. It is 
shown in (Chorowski et al., 2015) that making the attention 










































location-aware, that is using O-t-i in the equations dehning 
OLt, is crucial for reliable behaviour on long input sequences. 

A disadvantage of the approach from (Chorowski et ah, 
2015) is the complexity of the training procedure, which is 
0{LT) since weights ati have to be computed for all pairs 
of input and output positions. The same paper showcases a 
windowing approach that reduces the complexity of decod¬ 
ing to 0{L -h T). In this work we apply the windowing 
at the training stage as well. Namely, we constrain the at¬ 
tention mechanism to only consider positions from the range 
{rrit-i — wi,..., rrit-i + Wr), where mt-i is the median of 
Q-t-i, interpreted in this context as a distribution. The values 
wi and Wr dehne how much the window expands to the left 
and to the right respectively. This modification makes training 
signihcantly faster. 

Apart from the speedup it brings, windowing can be also 
very helpful for starting the training procedure. From our ex¬ 
perience, it becomes increasingly harder to train an ARSG 
completely from scratch on longer input sequences. We found 
that providing a very rough estimate of the desired align¬ 
ment at the early training stage is an effective way to quickly 
bring network parameters in a good range. Specihcally, we 
forced the network to only choose from positions in the range 
Rf — (-^mm ■ ■ ■ j ^max • The numbers Srain^ 

Smax, Vmin, Vmax vvere roughly estimated from the training 
set so that the number of leading silent frames for training ut¬ 
terances were between Smin and Smax and so that the speaker 
speed, i.e. the ratio between the transcript and the encoded 
input lengths, were between Vmin and Vmax- We aimed to 
make the windows Rt as narrow as possible, while keeping 
the invariant that the character yt was pronounced within the 
window Rt- The resulting sequence of windows is quickly 
expanding, but still it was sufficient to quickly move the net¬ 
work out of the random initial mode, in which it had often 
aligned all characters to a single location in the audio data. 
We note, that the median-centered windowing could not be 
used for this purpose, since it relies on the quality of the pre¬ 
vious alignment to dehne the window for the new one. 

3. INTEGRATION WITH A LANGUAGE MODEL 

Although an ARSG by construction implicitly learns how an 
output symbol depends on the previous ones, the transcrip¬ 
tions of the training utterances are typically insufficient to 
learn a high-quality language model. For this reason, we 
investigate how an ARSG can be integrated with a language 
model. The main challenge is that in speech recognition 
word-based language models are used, whereas our ARSG 
models a distribution over character sequences. 

We use the Weighted Finite State Transducer (WFST) 
framework (Mohri et ah, 2002; Allauzen et al., 2007) to build 
a character-level language model from a word-level one. A 
WFST is a hnite automaton, whose transitions have weight 
and input and output labels. It dehnes a cost of transducing 



Fig. 3. Schematic representation of the Attention-based Re¬ 
current Sequence Generator. At each time step t, an MLP 
combines the hidden state St_i with all the input vectors h; 
to compute the attention weights au- Subsequently, the new 
hidden state St and prediction for output label yt can be com¬ 
puted. 

an input sequence into an output sequence by considering 
all pathes with corresponding sequences of input and output 
labels. The composition operation can be used to combine 
FSTs that dehne different levels of representation, such as 
characters and words in our case. 

We compose the language model Finite State Transducer 
(FST) G with a lexicon FST L that simply spells out the 
letters of each word. More specihcally, we build an FST 
T — min(det(L o G)) to dehne the log-probability for char¬ 
acter sequences. We push the weights of this FST towards the 
starting state to help hypothesis pruning during decoding. 

For decoding we look for a transcript y that minimizes 
the cost L which combines the encoder-decoder (ED) and the 
language model (LM) outputs as follows: 

L = -\ogPED{.y\x) - PlogpLMiy) - iT (5) 

where /3 and 7 are tunable parameters. The last term is 
important, because without it the LM component dominates 
and the cost L is minimized by too short sequences. We note 
that the same criterion for decoding was proposed in (Hannun 
et al., 2014b) for a CTC network. 

Integrating an FST and an ARSG in a beam-search decod¬ 
ing is easy because they share the property that the current 
state depends only on the previous one and the input symbol. 
Therefore one can use a simple left-to-right beam search algo¬ 
rithm similar to the one described in (Sutskever et al., 2014) 
to approximate the value of y that minimizes L. 

The determinization of the FST becomes impractical for 
moderately large FSTs, such as the trigram model shipped 
with the Wall Street Journal corpus (see Subsection 5.1). To 
handle non-deterministic FSTs we assume that its weights 




































are in the logarithmic semiring and compute the total log- 
probability of all FST paths corresponding to a character pre¬ 
fix from the beam. This probability can be quickly recom¬ 
puted when a new character is added to the prefix. 

4. RELATED WORK 

A popular method to train networks to perform sequence 
prediction is Connectionist Temporal Classification (Graves 
et al., 2006). It has been used with great success for both 
phoneme recognition (Graves et al., 2013) and character- 
based LVCSR (Graves and Jaitly, 2014; Hannun et al., 
2014b, a; Miao et al., 2015). CTC allows recurrent neural 
networks to predict sequences that are shorter than the input 
sequence by summing over all possible alignments between 
the output sequence and the input of the CTC module. This 
summation is done using dynamic programming in a way 
that is similar to the forward and backward passes that are 
used to do inference in an HMM. In the CTC approach, out¬ 
put labels are conditionally independent given the alignment 
and the output sequences. In the context of speech recogni¬ 
tion, this means that a CTC network lacks a language model, 
which greatly boosts the system performance when added to 
a trained CTC network (Hannun et al., 2014b; Miao et al., 
2015). 

An extension of CTC is the RNN Transducer which com¬ 
bines two RNNs into a sequence transduction system (Graves, 
2012; Boulanger-Lewandowski et al., 2013). One network is 
similar to a CTC network and runs at the same time-scale as 
the input sequence, while a separate RNN models the prob¬ 
ability of the next label output label conditioned on the pre¬ 
vious ones. Like in CTC, inference is done with a dynamic 
programming method similar to the backward-forward algo¬ 
rithm for HMMs, but taking into account the constraints de¬ 
fined by both of the RNNs. Unlike CTC, RNN transduction 
systems can also generate output sequences that are longer 
than the input. RNN Transducers have led to state-of-the-art 
results in phoneme recognition on the TIMIT dataset (Graves 
et al., 2013) which were recently matched by an ASRG net¬ 
work (Chorowski et al., 2015). 

The RNN Transducer and ARSG approaches are roughly 
equivalent in their capabilities. In both approaches an implicit 
language model is learnt jointly with the rest of the network. 
The main difference between the approaches is that in ARSG 
the alignment is explicitly computed by the network, as op¬ 
posed to dealing with a distribution of alignments in the RNN 
Transducer. We hypothesize that this difference might have a 
major impact on the further development of these methods. 

Finally, we must mention two very recently published 
works that partially overlap with the content of this paper. 
In (Chan et al., 2015) Encoder-Decoder for character-based 
recognition, with the model being quite similar to ours. In 
particular, in this work pooling between the BiRNN layers 
is also proposed. Also, in (Miao et al., 2015) using FSTs 


to build a character-level model from an n-gram model is 
advocated. We note, that the research described in this paper 
was carried independently and without communication with 
the authors of both aforementioned works. 

5. EXPERIMENTS 

5.1. Data 

We trained and evaluated our models ^ on the Wall Street 
Journal (WSJ) corpus (available at the Linguistic Data Con¬ 
sortium as LDC93S6B and LDC94S13B). Training was done 
on the 81 hour long SI-284 set of about 37K sentences. As 
input features, we used 40 mel-scale filterbank coefficients 
together with the energy. These 41 dimensional features were 
extended with their first and second order temporal deriva¬ 
tives to obtain a total of 123 feature values per frame. Evalu¬ 
ation was done on the “eval92” evaluation set. Hyperparame¬ 
ter selection was performed on the “dev93” set. Eor language 
model integration we have used the 20K closed vocabulary 
setup and the bigram and trigram language model that were 
provided with the data set. We use the same text preprocess¬ 
ing as in (Hannun et al., 2014b), leaving only 32 distinct la¬ 
bels; 26 characters, apostrophe, period, dash, space, noise and 
end-of-sequence tokens. 

5.2. Training 

Our model used 4 layers of 250 forward + 250 backward GRU 
units in the encoder, with the top two layers reading every 
second of hidden states of the network below it (see Eigure 
2). Therefore, the encoder reduced the utterance length by 
the factor of 4. A centered convolution filter of width 200 
was used in the attention mechanism to extract a single feature 
from the previous step alignment as described in (4). 

The AdaDelta algorithm (Zeiler, 2012) with gradient clip¬ 
ping was used for optimization. We initialized all the weights 
randomly from an isotropic Gaussian distribution with vari¬ 
ance 0.1. 

We used a rough estimate of the proper alignment for 
the first training epoch as described in Section 2.3. After 
that the training was restarted with the windowing described 
in the same section. The window parameters were = 
wr = 100, which corresponds to considering a large 8 sec¬ 
ond long span of audio data at each step, taking into account 
the pooling done between layers. Training with the AdaDelta 
hyperparameters p = 0.95, e = 10“® was continued until 
log-likelihood stopped improving. Einally, we annealed the 
best model in terms of CER by restarting the training with 
e = 10-1°. 

We found regularization necessary for the best perfor¬ 
mance. The column norm constraint 1 was imposed on all 
weight matrices (Hinton et al., 2012b). This corresponds to 

^Our code is available at https://github.com/rizar/attention-lvcsr 



constraining the norm of the weights of all the connections 
incoming to a unit. 

5.3. Decoding and Evaluation 

As explained in Section 3, we used beam search to minimize 
the combined cost L dehned by (5). We hnished when k ter¬ 
minated sequences cheaper than any non-terminated sequence 
in the beam were found. A sequence was considered termi¬ 
nated when it ended with the special end-of-sequence token, 
which the network was trained to generate in the end of each 
transcript. 

To measure the best performance we used the beam size 
200 , however this brought us only « 10 % relative improve¬ 
ment over beam size 10. We used parameter settings a = 0.5 
and 7=1 with a language model and 7 = 0.1 without one. 
It was necessary to use an asymmetric window for the atten¬ 
tion when decoding with large 7 . More specihcally, we re¬ 
duced wl to 10. Without this trick, the cost L could be in- 
hnitely minimized by looping across the input utterance, for 
the penalty for jumping back in time included in logp(i/|a;) 
was not high enough. 

5.4. Results 

Results of our experiments are gathered in Table 5.4. Our 
model shows performance superior to that of CTC systems 
when no external language model is used. The improvement 
from adding an external language model is however much 
larger for CTC-based systems. The hnal peformance of our 
model is better than the one reported in (Hannun et al., 2014b) 
(13.0% vs 14.1%), but worse than the the one from (Miao 
et al., 2015) (11.3% vs 9.0%) when the same language mod¬ 
els are used. 

6. DISCUSSION 

A major difference between the CTC and ARSG approaches 
is that a language model is implicitly learnt in the latter. In¬ 
deed, one can see that an RNN sequence model as explained 
in 2.1 is literally contained in an ARSG as a subnetwork. We 
believe that this is the reason for the greater performance of 
the ARSG-based system when no external LM is used. How¬ 
ever, this implicit language model was trained on a relatively 
small corpus of WSJ transcripts containing less than 4 million 
characters. It has been reported that RNNs overfit on corpora 
of such size (Graves, 2013) and in our experiments we had to 
combat overhtting as well. Using the weight clipping brought 
a consistent performance gain but did not change the big pic¬ 
ture. For these reasons, we hypothesize that overhtting of the 
internal RNN language model was one of the main reasons 
why our model did not reach the performance level reported 
in (Miao et al., 2015), where a CTC network is used. 


Table I. Character Error Rate (CER) and Word Error Rate 
(WER) scores for our setup on the Wall Street Journal Corpus 
in comparison with other results from the literature. Note that 
our results are not directly comparable with those of networks 
predicting phonemes instead of characters, since phonemes 
are easier targets. 


Model 

CER 

WER 

Encoder-Decoder 

6.4 

18.6 

Encoder-Decoder + bigram LM 

5.3 

11.7 

Encoder-Decoder + trigram LM 

4.8 

10.8 

Encoder-Decoder + extended trigram LM 

3.9 

9.3 

Graves and Jaitly (2014) 

CTC 

9.2 

30.1 

CTC, expected transcription loss 

8.4 

27.3 

Hannun et al. (2014) 

CTC 

10.0 

35.8 

CTC + bigram LM 

5.7 

14.1 

Miao et al. (2015), 

CTC for phonemes + lexicon 

- 

26.9 

CTC for phonemes + trigram LM 

- 

7.3 

CTC + trigram LM 

- 

9.0 


That being said, we treat it as an advantage of the ARSG 
that it supports joint training of a language model with the 
rest of the network. Eor one, WSJ contains approximately 
only 80 hours of training data, and overhtting might be less 
of an issue for corpora containing hundreds or even thousands 
hours of annotated speech. Eor two, an RNN language model 
trained on a large text corpus could be integrated in an ARSG 
from the beginning of training by using the states of this lan¬ 
guage model as an additional input of the ARSG. We suppose 
that this would block the incentive of memorizing the training 
utterances, and thereby reduce the overhtting. In addition, no 
extra n-gram model would be required. We note that a similar 
idea has been already proposed in (Gulcehre et al., 2015) for 
machine translation. 

Einally, trainable integration with an n-gram language 
model could also be investigated. 

6.1. Conclusion 

In this work we showed how an Encoder-Decoder network 
with an attention mechanism can be used to build a LVCSR 
system. The resulting approach is signihcantly simpler than 
the dominating HMM-DNN one, with fewer training stages, 
much less auxiliary data and less domain expertise involved. 
Combined with a trigram language model our system shows 
decent, although not yet state-of-the-art performance. 

We present two methods to improve the computational 
complexity of the investigated model. Eirst, we propose pool¬ 
ing over time between BiRNN layers to reduce the length of 
the encoded input sequence. Second, we propose to use win¬ 
dowing during training to ensure that the decoder network 






performs a constant number of operations for each output 
character. Together these two methods facilitate application 
of attention-based models to large-scale speech recognition. 

Unlike CTC networks, our model has an intrinsic language¬ 
modeling capability. Furthermore, it has a potential to be 
trained jointly with an external language model. Investiga¬ 
tions in this direction are likely to be a part of our future 
work. 
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